Skip to content

GCP Batch jobs => mv: failed to close #6080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nick-youngblut opened this issue May 14, 2025 · 2 comments
Open

GCP Batch jobs => mv: failed to close #6080

nick-youngblut opened this issue May 14, 2025 · 2 comments

Comments

@nick-youngblut
Copy link
Contributor

Bug report

An example output for my failed GCP Batch jobs:

2025-05-14 11:08:52.610 PDT
mv: preserving times for '/mnt/disks/arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7/./Solo.out/GeneFull_ExonOverIntron/raw.h5ad': No space left on device
2025-05-14 11:08:52.610 PDT
mv: failed to close '/mnt/disks/arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7/./Solo.out/GeneFull_ExonOverIntron/raw.h5ad': No space left on device
2025-05-14 11:08:52.835 PDT
mv: failed to close '/mnt/disks/arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7/./Solo.out/GeneFull_ExonOverIntron/Summary.csv': No space left on device
2025-05-14 11:08:53.027 PDT
mv: preserving times for '/mnt/disks/arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7/./Solo.out/GeneFull_Ex50pAS/raw.h5ad': No space left on device
2025-05-14 11:08:53.027 PDT
mv: failed to close '/mnt/disks/arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7/./Solo.out/GeneFull_Ex50pAS/raw.h5ad': No space left on device
2025-05-14 11:08:53.287 PDT
mv: failed to close '/mnt/disks/arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7/./Solo.out/Gene/Summary.csv': No space left on device
2025-05-14 11:08:53.746 PDT
mv: failed to close '/mnt/disks/arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7/./versions.yml': No space left on device
2025-05-14 11:08:58.815 PDT
Task task/nf-13150368-174723-238fac2a-1e92-4c510-group0-0/0/0 runnable 0 exited with status 0
2025-05-14 11:08:58.815 PDT
Task task/nf-13150368-174723-238fac2a-1e92-4c510-group0-0/0/0 background runnables all exited on their own.
2025-05-14 11:08:58.815 PDT
Task task/nf-13150368-174723-238fac2a-1e92-4c510-group0-0/0/0 succeeded

I'm using:

process {
    shell         = ['/bin/bash', '-euo', 'pipefail']
}

... so I don't see why the mv commands are not causing the job to fail due to the No space left on device errors.

I'm guessing that the No space errors are due to a lack of disk space. My process:

process STAR_map {
    label "STAR_env"
    cpus Math.min(params.cpus_max, 20)
    memory { 40.GB * (1 + 0.5 * task.attempt) }
    time  { 4.h + (4 * task.attempt).h }
    disk { 
        def read1_size_gb = fastq_read1.size() / 1024 ** 3
        def read2_size_gb = fastq_read2.size() / 1024 ** 3
        def size_gb = (read1_size_gb + read2_size_gb) * 10
        def ssd_count = Math.max(1, Math.ceil(size_gb / 375)).intValue()
        println "${meta.id} => size_gb: ${size_gb.round(1)}; ssd_count: ${ssd_count}"
        [request: (375 * ssd_count * task.attempt).GB, type: "local-ssd"]
    }

    input:
    tuple val(meta), path(fastq_read1), path(fastq_read2), path(genome_dir)
    each path(star_par_file)

    output:
    path "*", emit: all // all files/folders will be saved into $outdir/STAR directory
    tuple val(meta), path("${meta.id}_Aligned.toTranscriptome.out.bam"), emit: trbam       // Transcritome alignments will be passed to Salmon (or other transcript quantification)
    tuple val(meta), path("${meta.id}_Aligned.sortedByCoord.out.bam"),   emit: bam         // Genome alignments, sorted by coordinate by STAR
    tuple val(meta), path("${meta.id}_ReadsPerGene.out.tab"),            emit: reads_per_gene   // for multiqc
    tuple val(meta), path("Log.final.out"),                              emit: log_final     // STAR log file
    tuple val(meta), path("Solo.out"),                                   emit: solo        // Solo output directory
    path "versions.yml",                                                 emit: versions

    script:
    """
    STAR ${params.extra_pars_star} \\
        --runThreadN ${task.cpus} \\
        --parametersFiles ${star_par_file} \\
        --genomeDir ${genome_dir} \\
        --readFilesIn ${fastq_read1} ${fastq_read2} \\
        --outSAMattrRGline ID:${meta.id}    SM:${meta.id}    PL:ILLUMINA \\
        --soloStrand ${meta.strandedness} \\
        2>&1 | tee ${task.process}_${meta.id}.log
        
    # remove temporary STAR files (sometimes they are not removed by STAR)
    rm -rf _STARtmp

    echo "# Compressing the output for the sake of scanpy" | tee -a ${task.process}_${meta.id}.log
    find Solo.out -type f -name "*.mtx" | xargs -P ${task.cpus} gzip
    find Solo.out -type f -name "*.tsv" | xargs -P ${task.cpus} gzip

    echo "# Converting solo output to h5ad" | tee -a ${task.process}_${meta.id}.log
    mtx-to-h5ad.py --output-dir Solo.out --sample "${meta.id}" Solo.out
    
    # rename Aligned.toTranscriptome.out.bam by adding the sample name
    mv Aligned.toTranscriptome.out.bam ${meta.id}_Aligned.toTranscriptome.out.bam
    mv Aligned.sortedByCoord.out.bam ${meta.id}_Aligned.sortedByCoord.out.bam
    mv ReadsPerGene.out.tab ${meta.id}_ReadsPerGene.out.tab

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
    star: \$(STAR --version | sed -e "s/STAR_//g")
    END_VERSIONS 
    """
}

...but the zero-exit GCP batch jobs are problematic for troubleshooting the issue.

Expected behavior and actual behavior

Nextflow does not seem to be handling all bash jobs error correctly so that GCP Batch jobs exit with non-zero values.

Steps to reproduce the problem

See above

Program output

The relevant log:

~> TaskHandler[id: 5; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/98/a4db46edf26cb46d29cdc66fd4bcdc]
~> TaskHandler[id: 6; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/dd/ecf421747f747dba9031d294bbc5a4]
May-14 10:54:22.668 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `STAR_map (LB_Brain_28-29_FLEX)` - terminated job=nf-eb1af425-1747237817762; task=0; state=SUCCEEDED
May-14 10:54:23.220 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Cannot read exit status for task: `STAR_map (LB_Brain_28-29_FLEX)` - For input string: ""
May-14 10:54:23.221 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 3; name: STAR_map (LB_Brain_28-29_FLEX); status: COMPLETED; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/eb/1af4254a3364c13759c44d3f1f91bd]
May-14 10:54:23.222 [Task monitor] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'TaskFinalizer' minSize=10; maxSize=10; workQueue=LinkedBlockingQueue[-1]; allowCoreThreadTimeout=false
May-14 10:54:23.330 [TaskFinalizer-1] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=STAR_map (LB_Brain_28-29_FLEX); work-dir=gs://arc-genomics-nextflow/work/eb/1af4254a3364c13759c44d3f1f91bd
  error [nextflow.exception.ProcessFailedException]: Process `STAR_map (LB_Brain_28-29_FLEX)` terminated for an unknown reason -- Likely it has been terminated by the external system
May-14 10:54:23.344 [TaskFinalizer-1] INFO  nextflow.processor.TaskProcessor - [eb/1af425] NOTE: Process `STAR_map (LB_Brain_28-29_FLEX)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
May-14 10:54:25.142 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `STAR_map (LB_Brain_28-29_FLEX)` submitted > job=nf-eb0eaa6a-1747245263940; uid=nf-eb0eaa6a-174724-28d58387-b7f0-44df0; work-dir=gs://arc-genomics-nextflow/work/eb/0eaa6a7855d6f04b37093ad82ab212
May-14 10:54:25.143 [Task submitter] INFO  nextflow.Session - [eb/0eaa6a] Re-submitted process > STAR_map (LB_Brain_28-29_FLEX)
May-14 10:55:12.115 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor google-batch > tasks to be completed: 6 -- submitted tasks are shown below
~> TaskHandler[id: 1; name: STAR_map (LB_Brain_10_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/56/6e3ab805acfa393943326fc1d31bd9]
~> TaskHandler[id: 2; name: STAR_map (LB_Brain_10_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7]
~> TaskHandler[id: 4; name: STAR_map (LB_Brain_28-29_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/ab/335dbd229a567686123b109c0f79f3]
~> TaskHandler[id: 5; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/98/a4db46edf26cb46d29cdc66fd4bcdc]
~> TaskHandler[id: 6; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/dd/ecf421747f747dba9031d294bbc5a4]
~> TaskHandler[id: 7; name: STAR_map (LB_Brain_28-29_FLEX); status: SUBMITTED; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/eb/0eaa6a7855d6f04b37093ad82ab212]
May-14 11:00:12.121 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor google-batch > tasks to be completed: 6 -- submitted tasks are shown below
~> TaskHandler[id: 1; name: STAR_map (LB_Brain_10_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/56/6e3ab805acfa393943326fc1d31bd9]
~> TaskHandler[id: 2; name: STAR_map (LB_Brain_10_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7]
~> TaskHandler[id: 4; name: STAR_map (LB_Brain_28-29_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/ab/335dbd229a567686123b109c0f79f3]
~> TaskHandler[id: 5; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/98/a4db46edf26cb46d29cdc66fd4bcdc]
~> TaskHandler[id: 6; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/dd/ecf421747f747dba9031d294bbc5a4]
~> TaskHandler[id: 7; name: STAR_map (LB_Brain_28-29_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/eb/0eaa6a7855d6f04b37093ad82ab212]
May-14 11:05:12.125 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor google-batch > tasks to be completed: 6 -- submitted tasks are shown below
~> TaskHandler[id: 1; name: STAR_map (LB_Brain_10_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/56/6e3ab805acfa393943326fc1d31bd9]
~> TaskHandler[id: 2; name: STAR_map (LB_Brain_10_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7]
~> TaskHandler[id: 4; name: STAR_map (LB_Brain_28-29_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/ab/335dbd229a567686123b109c0f79f3]
~> TaskHandler[id: 5; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/98/a4db46edf26cb46d29cdc66fd4bcdc]
~> TaskHandler[id: 6; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/dd/ecf421747f747dba9031d294bbc5a4]
~> TaskHandler[id: 7; name: STAR_map (LB_Brain_28-29_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/eb/0eaa6a7855d6f04b37093ad82ab212]
May-14 11:09:02.345 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `STAR_map (LB_Brain_10_FLEX)` - terminated job=nf-13150368-1747237814724; task=0; state=SUCCEEDED
May-14 11:09:02.931 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Cannot read exit status for task: `STAR_map (LB_Brain_10_FLEX)` - For input string: ""
May-14 11:09:02.932 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 2; name: STAR_map (LB_Brain_10_FLEX); status: COMPLETED; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7]
May-14 11:09:03.273 [TaskFinalizer-2] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=STAR_map (LB_Brain_10_FLEX); work-dir=gs://arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7
  error [nextflow.exception.ProcessFailedException]: Process `STAR_map (LB_Brain_10_FLEX)` terminated for an unknown reason -- Likely it has been terminated by the external system
May-14 11:09:03.274 [TaskFinalizer-2] INFO  nextflow.processor.TaskProcessor - [13/150368] NOTE: Process `STAR_map (LB_Brain_10_FLEX)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
May-14 11:09:05.609 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `STAR_map (LB_Brain_10_FLEX)` submitted > job=nf-02949505-1747246143812; uid=nf-02949505-174724-9e82d505-a03e-45560; work-dir=gs://arc-genomics-nextflow/work/02/9495054c5685ca39bee0dd5caffd2b
May-14 11:09:05.610 [Task submitter] INFO  nextflow.Session - [02/949505] Re-submitted process > STAR_map (LB_Brain_10_FLEX)
May-14 11:10:12.139 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor google-batch > tasks to be completed: 6 -- submitted tasks are shown below
~> TaskHandler[id: 1; name: STAR_map (LB_Brain_10_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/56/6e3ab805acfa393943326fc1d31bd9]
~> TaskHandler[id: 4; name: STAR_map (LB_Brain_28-29_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/ab/335dbd229a567686123b109c0f79f3]
~> TaskHandler[id: 5; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/98/a4db46edf26cb46d29cdc66fd4bcdc]
~> TaskHandler[id: 6; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/dd/ecf421747f747dba9031d294bbc5a4]
~> TaskHandler[id: 7; name: STAR_map (LB_Brain_28-29_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/eb/0eaa6a7855d6f04b37093ad82ab212]
~> TaskHandler[id: 8; name: STAR_map (LB_Brain_10_FLEX); status: SUBMITTED; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/02/9495054c5685ca39bee0dd5caffd2b]
May-14 11:15:12.144 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor google-batch > tasks to be completed: 6 -- submitted tasks are shown below
~> TaskHandler[id: 1; name: STAR_map (LB_Brain_10_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/56/6e3ab805acfa393943326fc1d31bd9]
~> TaskHandler[id: 4; name: STAR_map (LB_Brain_28-29_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/ab/335dbd229a567686123b109c0f79f3]
~> TaskHandler[id: 5; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/98/a4db46edf26cb46d29cdc66fd4bcdc]
~> TaskHandler[id: 6; name: STAR_map (LB_Brain_32-33_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/dd/ecf421747f747dba9031d294bbc5a4]
~> TaskHandler[id: 7; name: STAR_map (LB_Brain_28-29_FLEX); status: RUNNING; exit: -; error: -; workDir: gs://arc-genomics-nextflow/work/eb/0eaa6a7855d6f04b37093ad82ab2

Environment

  • Nextflow version: 24.10.5.5935
  • Java version: openjdk 11.0.1 2018-10-1
  • Operating system: Linux
  • Bash version: 5.1.16

Additional context

See https://nfcore.slack.com/archives/C02T98A23U7/p1736629652539469?thread_ts=1667836041.736919&cid=C02T98A23U7 for more context

@nick-youngblut
Copy link
Contributor Author

I should note that I am using scratch = true. My profile:

    gcp {
        process {
            executor      = "google-batch"
            errorStrategy = "retry"
            maxRetries    = 3
            scratch       = true
        }
        params {
            cpus_max      = 24
            memory_max    = "500.GB"
            time_max      = "72.h"
        }
    } 

I've included du -sh commands in my STAR_map process, and the total file size is ~100 GB after STAR mapping and subsequent steps; however, the job still throws mv: failed to close errors such as:

2025-05-14 11:08:53.287 PDT
mv: failed to close '/mnt/disks/arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7/./Solo.out/Gene/Summary.csv': No space left on device
2025-05-14 11:08:53.746 PDT
mv: failed to close '/mnt/disks/arc-genomics-nextflow/work/13/150368f33208ba6fef5ff688caf8c7/./versions.yml': No space left on device
2025-05-14 11:08:58.815 PDT

even when I provide 9 local SSDs (375 * 9 = 3375 GB).

@nick-youngblut
Copy link
Contributor Author

nick-youngblut commented May 15, 2025

At least in some cases, increasing google.batch.bootDiskSize to 100-200 GB can fix the issue.
However, all GCP Batch jobs will then use a large boot disk, even though only specific processes (e.g., STAR_map in my case) actually need the larger boot disk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant